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Abstract 

When training Support Vector Machines (SVMs) over non-separable data sets, one sets 
the threshold b using any dual cost coefficient that is strictly between the bounds of 0 and 
C. We show that there exist SVM training problems with dual optimal solutions with 
all coefficients at bounds, but that all such problems are degenerate in the sense that the 
“optimal separating hyperplane” is given by w = 0, and the resulting (degenerate) SVM 
will classify all future points identically (to the class that supplies more training data). We 
also derive necessary and sufficient conditions on the input data for this to occur. Finally, 
we show that an SVM training problem can always be made degenerate by the addition of 
a single data point belonging to a certain unbounded polyhedron, which we characterize in 
terms of its extreme points and rays. 
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1 Introduction 


We are given / examples (xi,yi),..., (x;,^), with x* G IR n and y % G {—1,1} for all i. The SVM 
training problem is to find a hyperplane and threshold (w, b ) that separates the positive and 
negative examples with maximum margin, penalizing misclassihcations linearly in a user-selected 
penalty parameter C > 0 . 1 This formulation was introduced in [2], For a good introduction to 
SVMs and the nonlinear programming problems involved in their training, see [3] or [1], We 
train an SVM by solving either of the following pair of dual quadratic programs: 

(P) min |||w|| 2 + C(J2i=i &) (D) max A • 1 — |ADA 

w, b, H A 

Vi (w • Xj + b) > 1 - & A ■ y = 0 

& >0 A* < C 

\ >0 

D is the symmetric positive semidehnite matrix defined by D tJ = yyyjX, • Xj. Throughout 
this note, we use the convention that if an equation contains i as an unsummed subscript, the 
corresponding equation is replicated for all i G {1,...,/}. 

In practice, the dual program is solved . 2 However, for this pair of primal-dual problems, the KKT 
conditions are necessary and sufficient to characterize optimal solutions. Therefore, w, b, H, and 
A represent a pair of primal and dual optimal solutions if and only if they satisfy the KKT 
conditions. Additionally, any primal and dual feasible solutions with identical objective values 
are primal and dual optimal. The KKT conditions (for the primal problem) are as follows: 


l 

W-^A^jX; = 0 (1) 

i= 1 

l 

EAdh = 0 (2) 

i= 1 

C-K-im = 0 (3) 

y*(xi • w + 6) - 1 + & > 0 (4) 

A i{yi(*i • w + b) - 1 + Ay} = 0 (5) 

A ki* = 0 (6) 

> 0 (7) 


The ^ are Lagrange multipliers associated with the Ay; they do not appear explicitly in either (P) 
or (D). The KKT conditions will be our major tool for investigating the properties of solutions 
to (P) and (D). 

Suppose that we have solved (D) and possess a dual optimal solution A. Equation (1) allows us 
to determine w for the associated primal optimal solution. Further suppose that there exists an 
i such that 0 < A* < C. Then, by equation (3), tq > 0, and by equation ( 6 ), Ay = 0. Because 
Xi 7 ^ 0, equation (5) tells us that y;(xj • w + b) — 1 + Ay = 0. Using Ay = 0, we see that we can 
determine the threshold b using the equation 6=1 — iq(Xj • w). 

1 Actually, we penalize linearly points for which ?y( w ■ x. ( + b) < 1; such points are not actually “misclassifica- 
tions” unless ?y( w • x. ( + b) <0. 

2 SVMs in general use a nonlinear kernel mapping. In this note, we explore the linear simplification in order 
to gain insight into SVM behavior. Our analysis holds identically in the nonlinear case. 
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Once 6 is known, we can determine the by noting that = 0 if A* ^ C (by equations (3) and 
( 6 )), and that £, = 1 — y*(Xj • w + 6) otherwise (by equation (5)). However, this is not strictly 
necessary, as it is w and b that must be known in order to classify future instances. 

We note that our ability to determine b and H is crucially dependent on the existence of a 
A i strictly between 0 and C. Additionally, the optimality conditions, and therefore the SVM 
training algorithm derived in Osuna’s thesis [3], depend on the existence of such a \ as well. 
On page 49 of his thesis Osuna states that “We have not found a proof yet of the existence of 
such A i, or conditions under which it does not exist.” Other discussions of SVM’s ( [1], [2]) also 
implicitly assume the existence of such a A*. 

In this paper, we show that there need not exist a A* strictly between bounds. Such cases are 
a subset of degenerate SVM training problems: those problems where the optimal separating 
“hyperplane” is w = 0 , and the optimal solution is to assign all future points to the same class. 
We derive a strong characterization of SVM degeneracy in terms of conditions on the input data. 
We go on to show that any SVM training problems can be made degenerate via the addition 
of a single training point, and that, assuming the two classes are of different cardinalities, this 
new training point can fall anywhere in a certain unbounded polyhedron. We provide a strong 
characterization of this polyhedron, and give a mild condition which will insure non-degeneracy. 

2 Support Vector Machine Degeneracy 

In this section, we explore SVM training problems with a dual optimal solution satisfying A* G 
{0, C} for all i. 

We begin by noting and dismissing the trivial example where all training points belong to the 
same class, say class 1. In this case, it is easily seen that A = 0, E = 0, w = 0, and 6 = 1 
represent primal and dual optimal solutions, both with objective value 0 . 

Definition 1 A vector A is a {0 ,C}-solution for an SVM training problem V if A solves (D), 
A i G {0, C } for all i and A^O (note that this includes cases where A i = C for all i). 

We demonstrate the existence of problems having {0, C'}-solutions with an example where the 
data lie in IR 2 : 


13 -10 8 -11 ' 

-10 8-6 8 
8-6 5-7 

-11 8 -7 10 


Suppose C = 10. The reader may easily verify that A = (10,10,10,10), w = 0, 6 = —1, 
S = (0, 2, 0, 2) are feasible primal and dual solutions, both with objective value 40, and are 
therefore optimal. Actually, given our choice of A and w, we may set 6 anywhere in the closed 
interval [—1,1], and set S = (1 + 6 ,1 — 6 ,1 + 6 ,1 — 6 ). 

We have demonstrated the possibility of {0, (7}-solutions, but the above example seems highly 
abnormal. The data are distributed at the four corners of a unit square centered at (1.5, 2.5), with 
opposite corners being of the same class. The “optimal separating hyperplane” is w = 0, which 
is not a hyperplane at all. We now proceed to formally show that all SVM training problems 
which admit { 0 , Cj-solutions are degenerate in this sense. 

The following lemma is obvious from inspection of the KKT conditions: 


X 

y 

(2,3) 

1 

(2,2) 

-1 

(1,2) 

1 

(1,3) 

-1 
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Lemma 2 Suppose that A is a {0,C}- solution to an SVM training problem V\ with C = C\. 
Given a new SVM training problem V 2 with identical input data and C = C 2 , ( C 2 jC\ ) • A is dual 
optimal for V 2 . The corresponding primal optimal solution(s) is (are) unchanged. 

We see that {0, C'}-solutions are not dependent on a particular choice of C. This in turn implies 
the following: 

Lemma 3 If A is a {0, C}-solution to an SVM training problem V, D ■ A = 0. 

PROOF: Since D is symmetric positive semidehnite, we can write D = RSR 7 , where S is a 
diagonal matrix with the (nonnegative) eigenvalues of D in descending order on the diagonal, R 
is an orthogonal basis of corresponding eigenvectors of D, and RR t = I If D • A ^ 0, then for 
some index /c, < 7 *, >0 and R/, ■ A ^ 0. 

For any value of C. let A c be the {0, Cj-solution obtained by adjusting A appropriately. This 
solution is dual optimal for a problem having input data identical to V, with a new value of C , 
by Lemma 2. 


A C DA C = 


> 


/T ‘ A-c|| 

3 =1 

(jfcllRfc • Ac -|| 2 
u A: C' 2 ||R fc -A 1 || 2 


Define S to be the number of non-zero elements in A. As we vary C, the optimal dual objective 
value of our family of { 0 , C'}-solutions is given by: 


/a (C) = 
< 


A c ■ 1 - ^A C DA C 

SC- l -a k C 2 ||R fc • AiH 2 


However, if 


CTfcllR* - A x ||2 

f a ( C* ) < 0. This is a contradiction, for A = 0 is feasible in V with objective value zero, and 
zero is therefore a lower bomid on the value of any optimal solution to V, regardless of the value 
of C. □ 

Theorem 4 If A is a {D^C}-solution to an SVM training problem V, w = 0 in all primal 
optimal solutions. 

Proof: 

Any optimal solution must, along with A, satisfy the KKT conditions. Exploiting this, we see: 
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D A 
ADA 

i =i i=i 
i i 

Y Y Xy,y.jX;x,\j 

i=l j =1 

l l 

CY KyiXi) ■ (Y A i% x i) 

*—i j'=i 

w • w 


This is a key result. It states that if our dual problem admits a {0, (7}-solution, the “optimal 
separating hyperplane” is w = 0. In other words, it is of no value to construct a hyperplane 
at all, no matter how expensive misclassihcations are, and the optimal classifier will classify all 
future data points using only the threshold b. Our data must be arranged in such a way that 
we may as well “de-metrize” our space by throwing away all information about where our data 
points are located, and classify all points identically. 

The converse of this statement is false: given an SVM training problem V that admits a primal 
solution with w = 0, it is not necessarily the case that all dual optimal solutions are {0,0}- 
solutions, nor even that a {0, Oj-solution necessarily exists, as the following example, constructed 
from the first example by “splitting” a data point into two new points whose average is one of 
the original points, shows: 

13 -10 6.5 9.5 -11 ' 

-10 8-5-7 8 

6.5 -5 3.25 4.75 -5.5 

9.5 -7 4.75 7.25 -8.5 

-11 8 -5.5 -8.5 10 

Again letting C = 10, the reader may verify that setting A = (10,10,5,5,10), w = 0, b = —1, 
S = (0, 20, 0, 20, 0, 0) are feasible primal and dual solutions, both with objective value 40, and 
are therefore optimal. With more effort, the reader may verify that A = {10,10, 5, 5,10} is the 
unique optimal solution to the dual problem, and therefore no {0, C'}-solution exists. 

Although our initial motivation was to study problems with optimal solutions having every dual 
coefficient A* at bounds, we gain additional insight by studying the following, broader class of 
problems. 

Definition 5 An SVM training problem V is degenerate if there exists an optimal primal 
solution to V in which w = 0. 

By Theorem 4, any problem that admits a {0, C'}-solution is degenerate. As in the {(^(re¬ 
solution case, one can use the KKT conditions to easily show that the degeneracy of an SVM 
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training problem is independent of the particular choice of the parameter C, and that w = 0 in 
all primal optimal solutions of a degenerate training problem. 

For degenerate SVM training problems, even though there is no optimal separating hyperplane 
in the normal sense, we still call those data points that contribute to the “expansion” w = 0 
with A i ^ 0 support vectors. Given an SVM training problem V, define K % to be the index set 
of points in class i, i G {1, —1}. 

Lemma 6 Given a degenerate SVM training problem V , assume without loss of generality that 
\K-i | < \K\\. Then all points in class —1 are support vectors; furthermore, A i = C if i G K.\. 
Additio7ially, if\K-f\ = \Kf\, the (unique) dual optimal solution is A = C. 

PROOF:Because w = 0, the primal constraints reduce to: 

Vib> 1 - & 

If \K-i\ < \K\\, the optimal value of b is 1, and fi is positive for i G | K-f\. Therefore, \ = C for 
i G K-i (by Equations 6 and 3). 

Assume K-\ = \K(. We may (optimally) choose b anywhere in the range [—1,1]. If b < 0, all 
points in class 1 have A* = C, and if b> 0, all points in class —1 have A, = C. In either case, 
there are at least \K-f\ points in a single class satisfying \ = C. But equation ( 2) says that the 
sum of the \, for each class must be equal, and since no Aj may be greater then C, we conclude 
that every A* is equal to C in both classes. □ 

Finally, we derive conditions on the input data for a degenerate SVM training problem V. 

Theorem 7 Given an SVM training problem V, assume without loss of generality that \K-i\ < 
\Ki\. Then: 

a. V is degenerate if and only if there exists a set of multipliers 17 for the points in K\ satisfying: 

0 < Ui < 1 

E Xi = E ^ 

i&K _j ieA'i 

E ^ 

ieA'i 

b. V admits a {0,C}-solution if and only if V is degenerate and the u>i in part (a) may all be 
chosen to be 0 or 1. 

Proof: 

(a, =>) Suppose V is degenerate. Consider a modification of V with identical input data, but 
(7 = 1; this problem is also degenerate. All points in class —1 are support vectors, and their 
associated A i are at 1, by Lemma 6. Letting A be any dual optimal solution to V, we see that 
letting c Oi = \ for i G K\ and applying Equation ( 2) demonstrates the existence of the uy. 

(a, 4=) Given uy satisfying the condition, we easily see that Aj = C for i G K- 1 , X, = UiC for 
i G Ki induces a pair of optimal primal and dual solutions to V with w = 0 using the KKT 
conditions. 

(b, =>) Given a {0, (7}-solution, w = 0 in an associated primal solution by Theorem 4, and 
setting tOi = Xi/C for i G K\ satishes the requirements on 17. 

(b, •<=) Let A i = UiC for i G Ad, and apply the KKT conditions. □ 
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3 The Degenerating Polyhedron 

Theorem 7 indicates that it is always possible to make an SVM training problem degenerate by 
adding a single new data point. We now proceed to characterize the set of individual points 
whose addition will make a given problem degenerate. For the remainder of this section, we 
assume that \K.\\ < \K\ |, and we denote DieA'_ 1 x ; by V, and \K-i | by n. 

Suppose we choose, for each i G Ad, an ay G [0,1], satisfying n — 1 < Y/ieKi < n - It is clear 
from the conditions of Theorem 7 that if we add a new data point 

V - £ uyx* 

i&Kx 

x c =- 

n- 

ieK 1 

that the problem becomes degenerate, where the new point has a multiplier given by a; c = 
n — and that all single points whose additions would make the problem degenerate can 

be found in such a manner. We denote the set of points so obtained by X]> 

We introduce the following notation. For k < n, we let Sk denote the set containing all possible 
sums of k points in K\. Given a point s G Sk, we define an indicator function y s : K\ —> {0,1} 
with the property % s (xj) = 1 if and only if Xj is one of the k points of K\ that were summed to 
make x. 

The region Xd is in fact a polyhedron whose extreme points and extreme rays are of the form 
V — x for x G S n -i and § n , respectively. More specifically, we have the following theorem; the 
proof is not difficult, but it is rather technical, and we defer it to Appendix A: 

Theorem 8 Given a non-degenerate problem V, consider the polyhedron 

P D = (V - s p ) +(V - x r ) | A s p, a s - > 0^ 1} 


Then Pd = Xd. 

An example is shown in Figure 1. On the one hand, the idea that the addition of a single data 
point can make an SVM training problem degenerate seems to bode ill for the usefulness of the 
method. Indeed, SVMs are in some sense not robust. This is a consequence of the fact that 
because errors are penalized in the L\ norm, a single outlier can have arbitrarily large effects 
on the separating hyperplane. However, the fact that we are able to precisely characterize the 
“degenerating” polyhedron allows us to provide a positive result as well. We begin by noting that 
in the example of Figure 1, the entire polyhedron of points whose addition make the problem 
degenerate is located well away from the initial data. This is not a coincidence. Indeed, using 
Theorem 8, we may easily derive the following theorem: 

Theorem 9 Given a non-degnerate problem V with \K_i\ < \K\\, suppose there exists a hyper¬ 
plane w through V/n, the center of mass of K_\, such that all points in K\ lie on one side of w, 
and the closest distance between a point in K\ and w is d. Then all points in the “degenerating” 
polyhedron Pc lie at least (|A'_ 1 | — 1) * d from w on the other side of w from K 1 . 

Using Theorem 7 we can easily show that if the center of mass of the points in the smaller 
class (V/n) does not lie in the convex hull of the points in the larger class, our problem is not 
degenerate, and we may apply Theorem 9 to bound below the distance at which an outlier would 
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Figure 1: A sample problem, and the “degenerating” polyhedron 

have to lie from V/n in order to make the problem degenerate. We conclude that if the class 
with larger cardinality lies well away from and entirely to one side of a hyperplane through the 
center of mass of the class of smaller cardinality, our problem is nondegenerate, and any single 
point we could add to make the problem degenerate would be an extreme outlier, lying on the 
opposite side of the smaller class from the larger class. 

4 Nonlinear SVMs and Further Remarks 

The conditions we have derived so far apply to the construction of a linear decision surface. 
It should be clear that similar arguments apply to nonlinear kernels. In particular, degenerate 
SVMs will occur if and only if the data satisfy the conditions of Theorem 7 after undergoing the 
nonlinear mapping to the high-dimensional space. It is not necessary that the data be degenerate 
in the original input space, although examples could be derived where they were degenerate in 
both spaces, for a particular kernel choice. The important message of Theorem 7, however, is 
that while degenerate SVMs are possible, the requirements on the mput data are so stringent that 
one should never expect to encounter them in practice. On another note, if a degenerate SVM 
does occur, one simply sets the threshold b to 1 or —1, depending on which class contributes more 
points to the training set. Thus in all cases, we are able to determine the threshold b. Of course, 
the wisdom of this approach depends on the data distribution. If our two classes lie largely on 
top of each other, than classifying according to the larger class may indeed be the best we can 
do (assuming our examples were drawn randomly from the input distribution). If, instead, our 
dataset looks more like that of Figure 1, we are better off removing outliers and resolving. 
Finally, a brief remark on complexity is in order. The quadratic program (D) can be solved 
in polynomial time, and solving this program will allow us to determine whether a given SVM 
training problem V is degenerate. However, the problem of determining whether or not a {0, C'}- 
solution exists is not so easy. Certainly, if V is not degenerate, no {0, Cl-solution exists, but 
the converse is false. Determining the existence of a {0, C'}-solution may be quite difficult: if 
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we require the x, : to lie in R 1 , determining whether a {0, C'}-solution exists is already equivalent 
to solving the weakly NP-complete problem SUBSET-SUM (see [4] for more information on 
NP-completeness). 3 
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A Proof of Theorem 8 


Theorem 8 Given a non-degenerate problem V, consider the polyhedron 

P D = (V - s p ) +(V - x r ) | A s p, a s - > 0y= 1} 

Proof: 

(a, Pd Q X d ) Given a set of A xP and a s r satisfying A sP , « s r > 0,= 1, we define A =, and set 


and, for i E AJ, we set 


U) c 


1 

1 + A’ 


Ui = cu c (x s p(xj) TXs r (xj)) 


Then 0 < uy < 1 for each i G AJ, and 


E ^ = 

ieKi 


n — 1 + nA 
1 + A 


— n — 


1 

1 + A’ 


which is in \n — 1 ,n), so we conclude that the assigned Wi are valid. Finally, substituting into 
Equation ( 8), we find: 


V- E 

i€K i 

n- JfuJi 

ieK 1 


h-U7 E (A>(Xi) +XV(Xi))x,; 
+ ieKi 


1 

1+A 


(1 + A)V - s p - s r 
(V - s p ) +(U - s r ) 


3 Because the problem is only weakly NP-complete, given a bound on the size of the numbers involved, the 
problem is polynomially solvable. 



We conclude that Pd C X d . 

(b, X D C P D ) Our proof is by construction: given a set of Ui, i G Ad, we show how to choose 
A sP and a s r so that: 


A s p > 0 Vs p G S n - 1 
= 1 

a s r > 0 Vx r G S n 
V- E^i*i 

- Ml - = (v - s p ) +(V - s r ) 

n- E Ui 

ieKi 

If we impose the reasonable “separability” conditions: 


V 


n- E^i 

ieK i 

E^i^i 

ieKi 


n- E^i 

ieKi 


V+V 


s p 


we can easily derive the following: 


( E Ui + 1) - n 
= J eKi - = A 

n - E ^ 

ieKi 

We are now ready to describe the actual construction. We will first assign the a s r, then the A sP . 
We describe in detail the assignment of the a s r, the assignment of the X sP is essentially similar. 
We begin by initializing each a sP to 0. At each step of the algorithm, we consider the “residual”: 


V- E^i*i 

ieKi 

n - E^i 

ieKi 


(V - s r ) 


( 8 ) 


Note that by expanding each s r in the n points of K\ which sum to it, we can represent (8) 
as a multiple of V minus a linear combination of the points of K\ — we will maintain the 
invariant that this linear combination is actually a nonnegative combination. During a step of 
the algorithm, we select the n points of K\ that have the largest coefficients in this expansion. 
If there is a tie, we expand the set to include all points with coefficients equal to the nth largest 
coefficient. Let j be the number of points in the set that share the nth largest coefficient, and 
let k (> n) be the total size of the selected set. We select the points s r containing 

the remaining max(A; — j, 0) points with the largest coefficients, and n — k + j of the j points 
which contain the nth largest coefficient. We will then add equal amounts of each of these s r 
to our representation until some pair of coefficients in the residual that were unequal become 
equal. This can happen in one of two ways: either the smallest of the coefficients in our set 
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can become equal to a new, still smaller coefficient, or the second smallest coefficient in the set 
can become equal to the smallest (this can only happen in the case where k > n.) At each step 
of the algorithm, the total number of different coefficients in the residual is reduced by at least 
one, so, within Ah steps, we will be able to assign all the « s r- (note that at each step of our 
algorithm, we increase ( 7 °f ^ ie tt s r )- The only way the algorithm could break down is 
if, at some step, there were fewer than n points in K\ with nonzero coefficients in the residual. 
Trivially, the algorithm does not break down at the first step — there must always be at least 
n points with non-zero coefficients initially. To show that the algorithm does not break down at 
a later step, assume that after assigning coefficients to the s r totalling k (< A), we are left with 
j (< n) non-zero coefficients. Noting that our algorithm requires that each of the j remaining 
points with non-zero coefficients is part of each s r with a non-zero coefficient, we can see that 
the the residual value of each of these j points is no more then — k. We derive the following 
bound on the initial sum of the coefficients, which we call I sum '- 


< ji 


n- £ ny 

i£K 1 


— k) + kn 


J 


< 


< 


n- £ uy 

ieKt 

n — 1 

n- £ uy 

iGK 1 


+ k(n - j ) 


+ k 


E Wi + 1 - n 

n—l i(z Kl 


n - E Vi n - E Ui 

ieKx ieK 1 

E Ui 

ieK-i 

n- E^i 

ieK 1 


E “i 

But this is a contradiction, I sum must be equal to . 

71 / j OJi 

i£Ki 

assign the tt s r successfully. Extremely similar arguments hold 


We conclude that we are able to 
for the A sP . □ 
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